Slide 1
Slide 2
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45
Slide 46
Slide 47
Slide 48
Slide 49
Slide 50
Slide 51
Slide 52
Slide 53
Slide 54
Basic theory of
Feed-forwa
rd Neural Networks
Claudio Mirabello
Resources
MIT
lectures on Deep Lea
rning
(
http://introtodeeplearning.com/
)
T
ensorFlow Playg
round
(
https://playground.tensorflow
.org
)
Keras Docs
(https://keras.io)
This is a neuron
wikipedia.org
This is a perceptron (1958)
Mark I Perceptron ma
chine
wikipedia.org
MIT “Intro
to Deep Learning”
Perceptrons caused excitement
"the embryo of an electronic computer that [the Navy] expects
will be
able to walk, talk, see, write, reproduce itself and be
conscious of its existence."
The New
Y
ork T
im
es
Perceptrons can only learn
linearly separable classes
Min
sky and
Paper
t, 19
69
Perceptrons can only learn
linearly separable classes
T
en
sorFlow Playground
But sometimes you want
to model non-linear functions
How do we make this non-linear then?
T
wo ingredients to add
1: Dif
ferentiab
le, non-linear a
ctivation functions
Common activation functions
Special case: softmax
●
Used in classifica
tion problems
●
Given k classes, it decides wh
ich one is more likely
●
One output per class, each output is assigne
d a probability
from 0 to 1
●
The sum of probabilities for all
outputs is 1
g
W
ait a second, the perceptro
n already has a
non-linear (step) activation function!
This is a perceptron
2: Multi-layer Perceptron (1986)
Now we're getting somewhere
Why stop at one hidden layer?
Deep Networks are simply
NNs with multiple hidden layers
Deeper
this way
https://playground.tensorflow
.org
Let's review:
- Perceptron
- XOR problem
-
A
ctivations
- Multi-layer perceptron
How do we decide
which weights are optimal?
●
A
linear regressor's weights
(coef
ficients)
are calculated in c
losed form
●
This can't be done if you have
hidden layers
and non-linear activations
How do we decide
which weights are optimal?
How do we decide which weights are optimal?
Lower loss => better predictions
Minimizing loss
Minimizing loss
Minimizing loss
Minimizing loss
Minimizing loss
Gradient descent
Activation functions have to be dif
ferentiable!
The learning rate η
Backpropagation
Backpropagation
Backpropag
ation example, step by step:
https://mattmazur
.com/2015/03/17/a-step-
by-step-backpropagation-e
xample/
Practical example
Simple network:
T
wo i
nputs [x1, x2]
T
wo we
ights [w1, w2]
No bi
as
Activation
functio
n g( )
One ou
t
put
ŷ
One labe
l y
Loss functi
on
( )
𝓛
W
eight-de
pendent
error J(W)
x1
x2
ŷ
w
1
w
2
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
1. Forward pass
x1
= 0.50
x2
= 0.51
w1
= 0.35
w2
= 0.40
Σ
= ?
ŷ
=
g(
Σ) =
1/(1+e
-
Σ
) = ?
0.35
0.40
0.50
0.51
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
1. Forward pass
x1
=
0.50
x2
=
0.51
w1
= 0.35
w2
= 0.40
Σ
= 0.35 * x1 + 0.4 * w2 =
0.38
ŷ
= g(Σ) =
1/(1+e
-Σ
) =
0.59
0.50
0.51
0.35
0.40
0.38
0.59
0.35
0.40
0.50
0.51
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
2. Calculate Loss
x1
=
0.50
x2
=
0.51
w1
= 0.35
w2
= 0.40
Σ
= 0.35 * x1 + 0.4 * w2 =
0.38
ŷ
= g(Σ) =
1/(1+e
-Σ
) =
0.59
Y
=
0.10
J(W)
=
(y
, ŷ) = ½ *
(y – ŷ)
𝓛
2
= ?
0.35
0.40
0.50
0.51
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
0.59
0.38
2. Calculate Loss
x1
=
0.50
x2
=
0.51
w1
= 0.35
w2
= 0.40
Σ
= 0.35 * x1 + 0.4 * w2 =
0.38
ŷ
= g(Σ) =
1/(1+e
-Σ
) =
0.59
Y
=
0.10
J(W)
=
(y
, ŷ) = ½ *
(y – ŷ)
𝓛
2
=
0.12
0.35
0.40
0.50
0.51
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
0.59
0.12
0.38
0.10
3. Backpropagate the error
J(W) =
(
𝓛
g(
X * W
)
)
∂J(W)/∂w1
=
∂J(W) / ∂ŷ
* ∂ŷ /
Σ
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
Derivatives ca
lculated in this order
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
3. Backpropagate the error: loss
J(W) =
(
𝓛
g(X * W)
)
∂J(W)/∂w1
=
∂J(W) / ∂ŷ
* ∂ŷ /
Σ
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
J(W) =
(y
, ŷ)
𝓛
= ½ * (y – ŷ)
2
∂J(W)
/ ∂ŷ
= ∂
(y
,
ŷ)
𝓛
/ ∂ŷ = ?
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
3. Backpropagate the error: loss
J(W) =
(
𝓛
g(X * W)
)
∂J(W)/∂w1
=
∂J(W) / ∂ŷ
* ∂ŷ /
Σ
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
J(W) =
(y
, ŷ)
𝓛
= ½ * (y – ŷ)
2
∂J(W)
/ ∂ŷ
= 2 * ½ * (y – ŷ) * -1
= -y + ŷ =
-0.1 + 0.59 =
0.49
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
3. Backpropagate the error: activation
J(W) =
(
𝓛
g(
X * W
)
)
∂J(W)/∂w1
=
0.49
* ∂ŷ / ∂
Σ
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
ŷ
=
g(Σ)
= 1/(1+e
-Σ
)
∂ŷ
/ ∂Σ
= ?
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
3. Backpropagate the error: activation
J(W) =
(
𝓛
g(
X * W
)
)
∂J(W)/∂w1
=
0.49
* ∂ŷ / ∂
Σ
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
ŷ
=
g(Σ)
= 1/(1+e
-Σ
)
∂ŷ
/ ∂Σ
= 1/(1+e
-Σ
) * (1 – 1/(1+e
-Σ
))
=
0.24
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
3. Backpropagate the error: weight
J(W) =
(g(
𝓛
X * W
))
∂J(W)/∂w1
=
0.49
* 0.24
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
Σ =
X * W
= x1 * w1 + x2 * w2
∂Σ/∂w1
= ?
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
3. Backpropagate the error: weight
J(W) =
(g(
𝓛
X * W
))
∂J(W)/∂w1
=
0.49
* 0.24
* ∂Σ / ∂w1
(loss)
(activation)
(weight)
Σ =
X * W
= x1 * w1 + x2 * w2
∂Σ/∂w1
= x1 + 0 =
0.5
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
4. W
eight update
J(W) =
(
𝓛
g(
X * W
)
)
∂J(W)/∂w1
=
0.49
* 0.24
* 0.51
= 0.06
(
grad
ient)
w1’
= ?
(loss)
(activation)
(weight)
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
4. W
eight update
J(W) =
(
𝓛
g(
X * W
)
)
∂J(W)/∂w1
=
0.49
* 0.24
* 0.51
= 0.06
(
grad
ient)
w1’
= w1 – η*0
.
06 = 0.35 – 0.06 = 0.29
(loss)
(activation)
(weight)
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
Exercise
Can you calcula
t
e the weigh
t
update for w2? How
many new
gradients do you nee
d
to calcula
te?
What is the new predicted output? H
as the error gone down?
What if I had another layer before this one?
0.50
0.51
0.35
0.29
0.40
0.34
0.38
0.32
0.59
0.58
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
4. W
eight update
J(W) =
(
𝓛
g(
X * W
)
)
∂J(W)/∂w1
=
0.49
* 0.24
* 0.51
= 0.06
(
grad
ient)
w1’
= w1 – η*0
.
06 = 0.35 – 0.06 = 0.29
W2’
= w2 – η*0.06 = 0.4 – 0.06 = 0.34
(loss)
(activation)
(weight)
0.50
0.51
0.35
0.40
0.38
0.59
0.10
0.12
x1
x2
Σ
w
1
w
2
g
ŷ
y
𝓛
J
Gradient vanishing
∂J/∂W1 = ∂J/∂Ŷ * ∂Ŷ/∂
Σ
N
* ∂Σ
N
/∂W
N
* ... * ∂Z
k
/∂Σ
k
* ∂Σ
k
/∂W
k
* … ∂Z
1
/∂Σ
1
* ∂Σ
1
/∂W
1
What happens if we backpropa
gate on a network with many
(N > k >
1) hidd
en layers?
Gradient vanishing
∂E/∂W1 = ∂J/∂Ŷ * ∂Ŷ/∂
Σ
N
* ∂Σ
N
/∂W
N
* ... * ∂Z
k
/∂Σ
k
* ∂Σ
k
/∂W
k
* … ∂Z
1
/∂Σ
1
* ∂Σ
1
/∂X
1
= O(10
-N
)
initial w1 = 0.5
optimal w1 = -0.2
5-layer gradi
ent ~
0.0000
1
How many iteration
s
do we
need
t
o get from 0.5 to -0.2?
These are all “zero-point-something
s” multiplied by each other
So the gradient become
s smaller by orders of magnitudes as we
go back more and more layers until it’
s so small that the network
is stuck